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INTERACTIVE CORRELATION O F COMPOUND INFORMATION 

And Genomic Information 

Field of the invention 

5 This invention relates to methods and products for identifying pharmaceutical 

leads, correlating information regarding gene expression, biological assays and other 
relevant information, and facilitating the purchase of related products. 

Background of the Invention 
10 Genomic sequence information is now available for several organisms, and 

additional data is added continuously. However, only a small fraction of the open 
reading frames now sequenced correspond to genes of known function: the function 
of most polynucleotide sequences, and any encoded proteins, is still unknown. These 
genes are now studied by means of, inter alia, polynucleotide arrays, which quantify 
1 5 the amount of mRNA produced by a test cell (or organism) under specific conditions. 
"Chemical genomic annotation" is the process of determining the transcriptional and 
bioassay response of one or more genes to exposure to a particular chemical, and 
denning and interpreting such genes in terms of the classes of chemicals for which 
they interact. A comprehensive library of chemical genomic annotations would 
20 enable one to design and optimize new pharmaceutical lead compounds based on the 
probable transcriptional and biomolecular profile of a hypothetical compound with 
certain characteristics. Additionally, one can use chemical genomic annotations to 
determine relationships between genes (for example, as members of a signal pathway 
or protein-protein interaction pair), and aid in determining the causes of side effects 
25 and the like. Finally, presenting the drug design researcher with a body of chemical 
genomic annotation information will generate research hypotheses that will stimulate 
follow-on experimental design, and therefor enable and stimulate purchase of related 
products to execute such experiments. 

Sabatini et al., US 5,966,712 disclosed a database and system for storing, com- 
30 paring and analyzing genomic data. 

Maslyn et al., US 5,953,727 disclosed a relational database for storing 
genomic data. 

Kohler et. al., US 5,523,208 disclosed a database and method for comparing 
polynucleotide sequences and the predicted functions of their encoded proteins. 
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Fujiyama et al., US 5,706,498 disclosed database and retrieval system, for 
identifying genes of similar sequence. 

Summary of the Invention 
5 We have now invented a system and method for analyzing and exploring the 

data resulting from chemical genomic annotation experiments, and for facilitating the 
design by a user of further experiments related to the user's goals, and thereby 
encouraging the purchase by the user of products related to the data and additional 
experiments. 

1 0 One aspect of the invention is a method for evaluating a test compound for 

biological activity, comprising: providing a database comprising a plurality of 
reference gene expression profiles, each profile comprising a representation of the 
expression level of a plurality of genes in a test cell exposed to a reference compound 
and a representation of the reference compound; providing a test gene expression 

1 5 profile, comprising a representation of the expression level of a plurality of genes in a 
test cell exposed to said test compound; comparing said test gene expression profile 
with said first gene expression profiles; identifying at least one first gene expression 
profile that is similar to said test gene expression profile; displaying said selected 
expression profile, and displaying product information related to said selected 

20 expression profile. 

Another aspect of the invention is a system for performing the method of the 
invention. 

Another aspect of the invention is a computer-readable medium having 
encoded thereon a set of instructions enabling a computer system to perform the 
25 method of the invention. 

Brief Description of the Figures 

Fig. 1 is a diagram of an embodiment of a system of the invention. 
Fig. 2 is a flow diagram illustrating an embodiment of a method of the 
30 invention. 
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Detailed Descrip tion 

5 Definitions : 

The term "test compound" refers in general to a compound to which a test cell 
is exposed, about which one desires to collect data. Typical test compounds will be 
small organic molecules, typically prospective pharmaceutical lead compounds, but 
can include proteins, peptides, polynucleotides, heterologous genes (in expression 
10 systems), plasmids, polynucleotide analogs, peptide analogs, lipids, carbohydrates, 
viruses, phage, parasites, and the like. 

The term "biological activity" as used herein refers to the ability of a test com- 
pound to alter the expression of one or more genes. 

The term "test cell" refers to a biological system or a model of a biological 
1 5 system capable of reacting to the presence of a test compound, typically a eukaryotic 
cell or tissue sample, or a prokaryotic organism. 

The term "gene expression profile" refers to a representation of the expression 
level of a plurality of genes in response to a selected expression condition (for 
example, incubation in the presence of a standard compound or test compound). 
20 Gene expression profiles can be expressed in terms of an absolute quantity of mRNA 
transcribed for each gene, as a ratio of mRNA transcribed in a test cell as compared 
with a control cell, and the like. As used herein, a "standard" gene expression profile 
refers to a profile already present in the primary database (for example, a profile 
obtained by incubation of a test cell with a standard compound, such as a drug of 
25 known activity), while a "test" gene expression profile refers to a profile generated 
under the conditions being investigated. The term "modulated" refers to an alteration 
in the expression level (induction or repression) to a measurable or detectable degree, 
as compared to a pre-established standard (for example, the expression level of a 
selected tissue or cell type at a selected phase under selected conditions). 
30 The term "correlation information" as used herein refers to information related 

to a set of results. For example, correlation information for a profile result can 
comprise a list of similar profiles (profiles in which a plurality of the same genes are 
modulated to a similar degree, or in which related genes are modulated to a similar 
degree), a list of compounds that produce similar profiles, a list of the genes 
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modulated in said profile, a list of the diseases and/or disorders in which a plurality of 
the same genes are modulated in a similar fashion, and the like. Correlation 
information for a compound-based inquiry can comprise a list of compounds having 
similar physical and chemical properties, compounds having similar shapes, 
5 compounds having similar biological activities, compounds that produce similar 
expression array profiles, and the like. Correlation information for a gene- or protein- 
based inquiry can comprise a list of genes or proteins having sequence similarity (at 
either nucleotide or amino acid level), genes or proteins having similar known 
functions or activities, genes or proteins subject to modulation or control by the same 
1 0 compounds, genes or proteins that belong to the same metabolic or signal pathway, 
genes or proteins belonging to similar metabolic or signal pathways, and the like, m 
general, correlation information is presented to assist a user in drawing parallels 
between diverse sets of data, enabling the user to create new hypotheses regarding 
gene and/or protein function, compound utility, and the like. Product correlation 
15 information assists the user with locating products that enable the user to test such 
hypotheses, and facilitates their purchase by the user. 

A "hypothesis" as used herein refers to a testable idea, inspired in by 
correlation information, regarding an explanation or model of gene or protein 
function, biochemical or biological function, drug or compound activity or toxicity, 
20 absorption, metabolism, distribution, excretion, and the like. Typical hypotheses 
herein include, without limitation, the identification of a compound or class of 
compounds as potential lead compounds or drugs, identification of genes or proteins 
that are characteristic of a disease state or adverse reaction, identification of genes 
and/or proteins that interact, and the like. 
25 "Similar", as used herein, refers to a degree of difference between two 

quantities that is within a preselected threshold. For example, two genes can be 
considered "similar" if they exhibit sequence identity of more than a given threshold, 
such as for example 20%. A number of methods and systems for evaluating the 
degree of similarity of polynucleotide sequences are publicly available, for example 
30 BLAST, FASTA, and the like. See also Maslyn et al. and Fujimiya et al., supra, 
incorporated herein by reference. The similarity of two profiles can be defined in a 
number of different ways, for example in terms of the number of identical genes 
affected, the degree to which each gene is affected, and the like. Several different 
measures of similarity, or methods of scoring similarity, can be made available to the 
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user: for example, one measure of similarity considers each gene that is induced (or 
repressed) past a threshold level, and increases the score for each gene in which both 
profiles indicate induction (or repression) of that gene. For example, if g x is gene "x", 
and p Bx is the expression level of g, in an experimental profile, p Sx is the expression 
level of g x in a standard profiles, and pr is a predetermined threshold level, we can 
define function H for any experimental ("E") and standard ("S") profile pair as H E , S = 
1 when both p Ex and p Sx > pr, and H E , S = 0 when either Pe x or p Sx < p^ Then, a 
simple similarity score can be defined asN = I x H x . This similarity score counts only 
the genes that are similarly induced in both profiles. A more informative score can be 
calculated as N> = £, p Ex - Psx | * ( p^ * ^ j»« , which ^ ^ mt0 consid . 
eration the difference in expression level between the experimental and standard 
profiles, for each gene induced above the threshold level. Other statistical methods 
are also applicable. 

The term "product information" as used herein refers to information regarding 
the availability, characteristics, price, and the like, of a product. Product information 
can consist of a hyperlink to such information. A product "related to data" refers to a 
product useful for the further exploration of the gene, protein, system, and/or 
compound to which the data pertains, or to relationships between the gene, protein, 
system, and/or compound highlighted in the correlation information. Exemplary 
products include, for example, bioassay kits and reagents, compounds useful as 
positive and negative controls, kits for purifying proteins or other biological products, 
antibodies for determining and/or isolating substances, compounds similar to the test 
compound useful for further study, additional data regarding gene or protein function 
and/or relationships (for example, sequence data from other species, information 
regarding metabolic and/or signal pathways to which the gene or protein belong, and 
the like), DNA microarrays useful for deterniining expression of the gene and/or 
related genes, information and analysis regarding features of a compound that are 
likely to be responsible for the observed activity, and the like. 

The term "hyperlink" as used herein refers to feature of a displayed image or 
text that provides information additional and/or related to the information already 
currently displayed when activated, for example by clicking on the hyperlink An 
HTML HREF is an example of a hyperlink within the scope of this invention. For 
example, when a user queries the database of the invention and obtains an output such 
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as a list of the genes most induced or repressed by a selected compound, one or more 
of the genes listed in the output can be hyperlinked to related information. The 
related information can be, for example, additional information regarding the gene, a 
list of compounds that affect gene induction in a similar way, a list of genes having a 
known related function, a list of bioassays for deterniining activity of the gene 
product, product information regarding such related information, and the like. 



General Method : 

The system of the invention provides a correlative database that permits one to 
1 0 study relationships between different genes, between genes and a variety of 
compounds, to investigate structure-function relationships between different 
compounds, and to facilitate the purchase of products based on such observed 
relationships. The database contains a plurality of standard gene expression profiles, 
which comprise the expression level of a plurality of genes under a plurality of 
1 5 specified conditions. The conditions specified can include expression within a 

particular cell type (for example, fibroblast, lymphocyte, neuron, oocyte, hepatocyte, 
and the like), expression at a particular point in the cell cycle (e.g., Gl), expression in 
a specified disease state, the presence of environmental factors (for example, 
temperature, pressure, CO a partial pressure, osmotic pressure, shear stress, con- 
20 fluency, adherence, and the like), the presence of pathogenic organisms (for example, 
viruses, bacterial, fungi, and extra- or intracellular parasites), expression in the 
presence of heterologous genes, expression in the presence of test compounds, and the 
like, and combinations thereof. The database can contain expression profiles for a 
plurality of different species, for example, human, mouse, rat, chimpanzee, yeast such 
15 as Saccharomyces cerevisiae, bacteria such as Rcoli, and the like. The database 
preferably comprises expression profiles for at least 10 different genes from a partic- 
ular organism, more preferably in excess of 500 genes, and can include a substantial 
fraction of the genes expressed by an organism, such as, for example, about 50%, 
about 75%, about 90%, or essentially 100%. The standard expression profiles are 
0 preferably annotated, for example, with information regarding the conditions under 
which the profile was obtained. Preferably, the database also contains annotations for 
one or more genes, more preferably for each gene represented in the database. The 
annotations can include any available information about the gene, such as, for 
example, the gene's names and synonyms, the gene's nucleotide sequence the amino 
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acid sequence encoded, any known biological activity or function, any genes of 
similar sequence, any metabolic or protein interaction pathways to which it is known 
to belong, a listing of assays capable of detennining the activity of its protein product, 
and the like. 

5 The database contains interpretive gene expression profiles and bioassay 

profiles for a plurality of different compounds that comprise a representation of a 
compound's mode of action and/or toxicity ("drug signatures") , and can include 
experimental compounds and/or "standard" compounds. Drug signatures provide a 
unique picture of a compound's comprehensive activity in vivo, including both its 
1 0 effect on gene transcription and its interaction with proteins. Standard compounds are 
preferably well-characterized, and preferably exhibit a known biological effect on 
host cells and/or organisms. Standard compounds can advantageously be selected 
from the class of available drug compounds, natural toxins and venoms, known 
poisons, vitamins and nutrients, metabolic byproducts, and the like. The standard 
1 5 compounds can be selected to provide, as a set, a wide range of different gene 

expression profiles. The records for the standard compounds are preferably annotated 
with information available regarding the compounds, such as, for example, the 
compound name, structure and chemical formula, molecular weight, aqueous 
solubility, pH, lipophilicity, known biological activity, source, proteins and/or genes it 
20 is known to interact with, assays for detecting and/or confirming activity of the 
compound or related compounds, and the like. Alternatively, one can employ a 
database constructed from random compounds, combinatorial libraries, and the like. 

The database further contains bioassay data derived from experiments in 
which one or more compounds represented in the database are examined for activity 
25 against one or more proteins represented in the database. Bioassay data can be 
obtained from open literature and directly by experiment. 

Further, the database preferably contains product data related to the 
compounds, genes, proteins, expression profiles, and/or bioassay data otherwise 
present in the database. The product data can be information regarding physical 
30 products, such as bioassay kits and reagents, compounds useful as positive and 

negative controls, compounds similar to the test compound useful for further study, 
DNAmicroarrays and the like, or San comprise information-based products, such as 
additional data regarding gene or protein function and/or relationships (for example, 
sequence data from other species, information regarding metabolic and/or signal 
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pathways to which the gene or protein belong, and the like), algorithmic analysis of 
the compounds to determine critical features and likely cross-reactivity, and the like. 
The product information can take the form of data or information physically present 
in the database, hyperlinks to external information sources (such as a vendor's 
5 catalog, for example, supplied via the Internet or CD-ROM), and the like. 

The database thus preferably contains five main types of data: gene informa- 
tion, compound information, bioassay information, product information, and profile 
information. Gene information comprises information specific to each included gene, 
and can include, for example, the identity and sequence of the gene, one or more 

1 0 unique identifiers linked to public and/or commercial databases, its location on a 
standard array plate, a list of genes having similar sequences, any known disease 
associations, any known compounds that modulate the encoded protein activity, 
conditions that modulate expression of the gene or modulate the protein activity, and 
the like. Product information comprises information specific to the available 

1 5 products, and varies depending on the exact nature of the product, and can include 
information such as price, manufacturer, contents, warranty information, availability, 
delivery time, distributor, and the like. Bioassay information comprises information 
specific to particular compounds (where available), and can include, for example, 
results from high-throughput screening assays, cellular assays, animal and/or human 

20 studies, biochemical assays (including binding assays and enzymatic assays) and the 
like. Compound information comprises information specific to each included 
compound, such as, for example, the chemical name(s) and structure of the 
compound, its molecular weight, solubility and other physical properties, proteins that 
it is known to interact with, the profiles in which it appears, the genes that are affected 

25 by its presence, and available assays for its activity. Profile information includes, for 
example, the conditions under which it was generated (including, for example, the cell 
type(s) used, the species used, temperature and culture conditions, compounds 
present, time elapsed, and the like), the genes modulated with reference to a standard, 
a list of similar profiles, and the like. The information is obtained by assimilation of 

30 and/or reference to currently-available databases, and by collecting experimental data. 
It should be noted that the gene database, although large, contains a finite number of 
records, limited by the number of genes in the organisms under study. The compound 
database is potentially unlimited, as new compounds are made and tested constantly. 
The profile database, however, is still larger, as it represents information regarding the 
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interaction of a very large number of genes with a potentially infinite number of 
different compounds, under a variety of conditions. 

Experimental data is preferably collected using a high-throughput assay 
format, capable of exarnining, for example, the effects of a plurality of compounds 
5 (preferably a large number of standard compounds, for example 10,000) when 

administered individually or as a mixture to a plurality of different cell types. Assay 
data collected using a uniform format are more readily comparable, and provide a 
more accurate indication of the differences between, for example, the activity of 
similar compounds, or the differences in sensitivity of similar genes. 

1 0 The system provides several different ways to access the information 

contained within the database. An operator can enter a test gene expression profile 
into the system, cause the system to compare the test profile with stored standard gene 
expression profiles in the database, and obtain an output comprising one or more 
standard expression profiles that are similar to the test profile. The standard 

1 5 expression profiles are preferably accompanied by annotations, for example providing 
information to the operator as to the similarity of the test profile to standard profiles 
obtained from disease states and/or standard compounds. The test gene expression 
profile preferably includes an indication of the conditions under which the profile is 
obtained, for example a representation of a test compound used, and/or the culture 

20 conditions. 

The output preferably further comprises a list of the genes that are modulated 
(up-regulated or down-regulated) in the test gene expression profile, as compared with 
a pre-established expression value, a pre-selected standard expression profile, a 
second test gene expression profile, or another pre-set threshold value. 

25 The output is preferably hyperlinked, so that the operator can easily switch 

from, for example, a listing of the similar standard expression profiles to a listing of 
the modulated genes in a selected standard expression profile, or from a gene listed in 
the test profile to a list of the standard expression profiles in which the gene is 
similarly modulated, or to a list of the standard compounds (and/or conditions) which 

30 appear to modulate the selected gene. The output can comprise correlation 

information that highlights features in common between different genes, targets, 
profiles, compounds, assays, and the like, to assist the user in drawing useful 
correlations. For example, the output can contain a list of genes that were modulated 
in the user's experiment with a selected compound: if a plurality of the genes are 
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indicated as associated with liver toxicity, the system can prompt the user that the 
compound is associated with a toxic drug signature, and prompt the user to continue 
with the next compound. Conversely, the output could indicate previously unnoticed 
associations between different pathways, leading the user to explore a hitherto 
5 unknown connection. The output preferably includes hyperlinks to product informa- 
tion, encouraging the user to purchase or order one or more products from a selected 
vendor, where the product(s) relate specifically to the focus of the database inquiry 
and the correlation information that results, and is presented back to the user to 
facilitate hypothesis generation. For example, the output can provide links to 
10 products useful for corifirming the apparent activity of a compound, for measuring 
biological activity directly, for assaying the compound for possible side effects, and 
the like, prompting the user to select products useful in the next stage of 
experimentation. 

The system is preferably provided with an algorithm for assessing similarity of 

15 compounds. Suitable methods for comparing compounds and determining their mor- 
phological similarity include "3D-MT", as set forth in copending application USSN 
09/475413, incorporated herein by reference in full, Tanimoto similarity (Daylight 
Software), and the like. Preferably, the system can be queried for any compounds that 
are similar to the test compound in structure and/or morphology. The output from this 

20 query preferably includes the corresponding standard expression profiles (or 

hyperlinks to the corresponding standard expression profiles), and preferably further 
includes a listing, description, or hyperlink to an assay capable of detemining the bio- 
logical activity of the standard and/or test compound. 

Thus, for example, if the user inputs an experimental expression profile 

25 resulting from incubation of test cells with a particular experimental compound, the 
user can obtain an output comprising an estimate of the quality of the data, an 
identification of the genes affected by the compound, a listing of similar profiles and 
the conditions under which they were obtained (for example, the compounds used), 
and a list of compounds having a structural similarity. The output can be provided in 

30 a hyperlinked format that permits the user to then investigate and explore the data. 
For example, the user can examine which genes are modulated, and determine 
whether or not the genes have yet been characterized as to function or activity, and 
under what conditions each gene is modulated in a similar fashion. Alternatively, the 
user can compare the profile obtained with the profile of a desired outcome, for 
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example comparing the profile obtained by incubation of diseased or infected tissue 
with a test compound against a profile obtained from healthy (unperturbed) tissue. 
Alternatively, the user can compare the profile with the profiles obtained using 
standard compounds, for example using a drug of known activity, mechanism of 
5 action, and specificity, thus determining whether the test compound operates by a dif- 
ferent mechanism, or if by the same mechanism whether it is more or less active than 
the standard. Additionally, the user can compare the structure of the test compound 
with the structures of other compounds with similar profiles (to determine which 
structural features of the compounds are common, and thus likely to be important for 
1 0 activity), or can compare the compound's profile with the profiles obtained from 
structurally similar compounds in general. 

The system can be configured as a single, integrated whole, or can be 
distributed over a variety of locations. For example, the system can be provided as a 
central database/server with remotely-located access units. The remote access units 
15 can be provided with sufficient system capability to accept and interpret test gene 
expression profiles, and to compare the test profiles with standard gene expression 
profiles. Remote units can further be provided with a copy of some or all of me 
database information. Optionally, the remote system can be used to upload test gene 
expression profiles to the central system to update the central database, or a "private" 
20 database supplementary to the main database can be stored in or near the remote unit. 
Further, the system can be divided into 'Vendor" and "client" portions, 
separating segments of the system into any economically useful subsets, in which 
interaction between a vendor unit and a client unit is monitored and/or governed by 
the client's state. For example, the system can be configured to treat a primary 
25 database as a vendor unit, and remote access units as client units. The vendor 

database can be configured to respond to a plurality of different permission levels, 
wherein lower permission levels are granted access to only a restricted subset of the 
available data, with successively higher levels obtaining access to greater amounts of 
data. For example, the lowest permission level can provide access only to publicly- 
30 available gene sequences and public annotations, without correlations to compounds 
or profiles. The client system in such cases can be equipped to provide statistical 
analysis of the profile generated bythe user, the ability to identify genes within the 
profile, and the ability to compare gene sequences for similarity. In this case, the 
interaction between client unit and vendor unit can be limited to access to the 
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publicly-available gene sequences, which can be provided electronically, or 
exchanged via a storage medium (for example, using CD-ROM, DVD, or the like). 
The bulk of the vendor database (for this permission level) can be pre-installed at the 
client location, avoiding the need to download large amounts of data (for example, 
5 limiting downloads only to updates). This level can be essentially unrestricted, i.e., 
allowing public access without need for a pre-existing vendor-client relationship. 

An intermediate permission level can provide access to a larger subset of data, 
for example including links to some or all of the available profile and compound data 
in addition to the information provided to the lower permission level. In this case, the 
10 interaction between client and vendor systems occurs contemporaneously or after a 
client account is established, determining the level of access to be granted the client. 
If conducted electronically, the interaction is preferably accomplished through means 
of a secure transaction, to ensure that neither the vendor data nor the client queries are 
rendered non-confidential. Such transactions can be conducted, for example, by 

1 5 adapting the systems and methods disclosed in US '5 ,724,424, incorporated herein by 
reference in full. The data in this case can be limited to compounds that are publicly 
known (for example, commercially available, or disclosed in patents or the like) and 
profile data related to those compounds. Alternatively, the system can be arranged so 
that the client obtains access only to a specific field, for example, profiles related to 

20 diabetic conditions, autoimmune conditions, cancer, and the like. For cases of 

intermediate permission, the vendor system can filter output before it is transmitted to 
the client system, to insure that only the permitted degree of information is 
distributed. The vendor system can also filter input, to insure that vendor system 
resources are not consumed in preparing answers that cannot be delivered to the client 

25 system. 

At the penultimate permission level, the client is granted access to all data in 
the database except for data that is proprietary, restricted, or exclusively granted to 
another client. The ultimate permission level may be available only to the vendor 
itself, or can be made available to one or more clients if no exclusivity is granted to 
30 clients. 

Additionally, the system can include provisions for accepting new data from a 
remote client, for example, to enable a user to store his or her own data on the vendor 
server. Access to such client data can be restricted to only the same client, or can be 
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made available to all clients or a subset thereof (for example, in exchange for a credit 
or other privilege). 

Fig. 1 illustrates a system of the invention, comprising vendor server 10 
containing vendor database 12. Vendor database 12 in turn contains a genomic 
5 database 14, a compound database 16, and a profile database 1 8, which in turn contain 
optional private (user) databases 15, 17, and 19. Alternatively, the private databases 
can be physically located outside the vendor databases, for example, elsewhere within 
the vendor system or maintained in parallel within the user's site. The vendor 
databases can further comprise a product database 30 maintained within the vendor 
1 0 system, and/or an external product database 32 linked to the vendor system. The 
product databases can contain information regarding products available from the 
vendor, a third-party vendor, or both. One or both of the product databases can 
further comprise user-specific data (31, 33) such as, for example, user account 
information (account number, format preferences, shipping addresses, prior order 
1 5 history, authorization level, and the like), the user's notes or annotations regarding 
particular products, and the like. The product databases are preferably provided with 
hyperlinks that facilitate user purchases of the products displayed. The vendor system 
is connected to a plurality of user systems 50, 51, 52, which in turn contain individual 
user databases 55, 56, 57. The user systems can communicate with the vendor system 
20 by any convenient medium, including, without limitation, direct connection, 

distributed network (LAN or WAN), internet connection, virtual private network 
(VPN), direct dial-in, and the like. The hardware employed for use in the method of 
the invention can comprise general-purpose computers, for example currently- 
available personal computers and workstations, or special-purpose terminals designed 
25 for this application. 

Fig. 2 illustrates a simple flow diagram for an embodiment of the invention. 
The user may begin by uploading data into the system 200 (or otherwise acquiring 
profile data), or alternatively may simply begin by browsing 205 for a gene, 
compound, or profile of interest already present in the system. If new data is added, 
30 the data can optionally be evaluated and validated 210. Optionally, the new data can 
be uploaded to the primary database, as either a public or private addition, or can be 
stored in the user portion of the system 215. After data validation (if any), the data is 
examined by the system, and the genes and profile identified 220. This result is 
displayed 230, along with hyperlinks to related product information. Preferably, the 
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results are displayed in a manner that highlights correlations between similar 
expression profiles, the profiles of similar compounds, the profiles of related genes, 
and the like. The user can then select more information regarding one or more related 
compounds 231, genes 233, profiles 235, and the like, at which point the system can 
5 display relevant compound products 232, relevant clones and/or bioassay products 
234, or relevant array products 236. The output display preferably facilitates selection 
of relevant products by the user, flagging selected products 240 (for example, adding 
them to a "shopping cart" system). The user can then select 245 a path of inquiry, and 
search for compounds of similar structure, morphology, or activity (in terms of 
1 0 profile), for selected genes or genes of similar sequence or known function, or for 
similar profiles 205. These results are displayed 230, and the user invited to continue 
browsing until finished. Alternatively, the user can pre-select various forms of out- 
put, for example, selecting to have the initial data display include a listing of similar 
compounds linked to displays of their profiles, or a listing of the experimental profile 
1 5 along with a list of similar profiles ranked by degree of similarity. Alternatively, the 
user can upload a chemical structure (whether real or hypothetical), and obtain a 
display of a predicted profile extrapolated from the profiles of morphologically 
similar compounds. 

These methods can be conducted on a single computer, or can be distributed 
20 over a plurality of computers. For example, steps 200, 205 and 230 can occur on a 
remote computer (at the user site), while other steps occur on a local computer or 
computers, or at another remote site distinct from the user's site (the vendor server). 

Data concerning experimental pharmaceutical compounds and their biological 
activity are extremely sensitive, valuable and confidential. In embodiments that 
25 include computers or other hardware at a plurality of locations, it is presently 

preferred to include some provision for security, for example by regulating access or 
by means of encrypted commands and results. Suitable methods are known in the art, 
including, for example, public key encryption and SSL (secure socket layer) 
connections. Alternatively, rather than reporting gene expression data in terms of 
30 absolute expression, one can report the data in terms of differences from a given 
standard. Thus, if gene "A" has an arbitrary standard expression value of 56 (in 
arbitrary units), and in an experimental profile gene "A" is expressed at a level of 97, 
the data for gene "A" can be reported as expression of 41 rather than 97. A different 
standard level can be established for each gene employed, essentially fo rmin g an 
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encoding profile. A plurality of different encoding profiles can be established and 
enumerated for each user and shared by secure means, with the user and vendor 
simply indicating which profile (by number) is used for each transmission. Further, 
one can express the data in terms of other arithmetic functions and combinations of 
5 functions of an encoding profile, as long as the original data can be unambiguously 
retrieved by the authorized party. For example, the encoding transform for a 
particular encoding profile can specify that data for the first gene is expressed as the 
difference between the experimental and profile values, while data for the next gene is 
expressed as a percentage of the profile value, while data for the third gene is 

1 0 expressed as the difference between the third experimental value and the second 
experimental value, and the like. If additional security is desired, one can establish 
encoding profiles and transforms that change depending on other parameters, for 
example by date, by user number, by time of file modification, by number of data 
sets, and the like, and combinations thereof. Alternatively, one can specify a large 

1 5 number of available encoding profiles, and specify in advance a random sequence of 
profiles to employ, avoiding the identification of any profile during transmission of 



15- 



WO 02/31704 



PCT/US01/32016 



What is claimed : 

1 . A method for facilitating exploration of biological and chemical data, 
5 comprising: 

a) providing a database comprising a plurality of standard gene 
expression profiles, each profile comprising a representation of the expression 
level of a plurality of genes in a cell exposed to a standard compound and a 
representation of the standard compound; 
10 b) displaying a selected gene expression profile; 

c) displaying correlation information related to said gene expression 
profile to facilitate generation of a hypothesis; and 

d) displaying relevant product information to facilitate testing said 
hypothesis. 

15 

2. The method of claim 1, wherein said correlation information is 
selected from the group consisting: identification of a profile similar to said gene 
expression profile, identification of a compound that produces a similar profile, 
identification of a gene modulated in said profile, identification of a disease or 

20 disorder in which a plurality of the same genes are modulated in a similar fashion, 
identification of compounds having similar physical and chemical properties as the 
compound used to generate the profile, identification of compounds having similar 
shapes, identification of compounds having similar biological activities, 
identification of a gene or protein having sequence similarity to a selected gene or 

25 protein, identification of a gene or protein having a similar known function or activity, 
identification of a gene or protein subject to modulation or control by the same 
compound, identification of a gene or protein that belongs to the same metabolic or 
signal pathway, and identification of a gene or protein belonging to similar metabolic 
or signal pathways. 

30 

3 . The method of claim 1 , wherein said relevant product information is 
selected from the group consisting 5f: information regarding a bioassay reagent useful 
for measuring activity of an identified enzyme, information regarding a compound 
useful as a positive control, information regarding a compound useful as a negative 
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control, information regarding a kit for purifying an identified protein, information 
regarding antibodies for determining and/or isolating substances, information 
regarding a compound similar to the test compound useful for further study, addi- 
tional data regarding gene or protein function and/or relationships, sequence data from 
5 other species, information regarding metabolic and/or signal pathways to which the 
gene or protein belong, information regarding a DNA microarray useful for 
determining expression of the gene and/or related genes, and information and analysis 
regarding features of a compound that are likely to be responsible for the observed 
activity. 

10 

4. The method of claim 3, wherein said product information further com- 
prises a hyperlink that facilitates direct purchase of said product. 

5. The method of claim 1 , wherein said database further comprises drug 

1 5 signatures for a plurality of compounds, wherein each said drug signature comprises a 
representation of the physical and chemical characteristics of each compound, data 
regarding the effect of each compound on the transcription of a plurality of genes, and 
data regarding the effect of each compound on a plurality of proteins. 

20 6. The method of claim 1 , wherein said gene expression profile is 

selected on the basis of its similarity to an experimental expression profile provided 
by the user. 

7. A method for facilitating exploration of biological and chemical data, 
25 comprising: 

a) providing a database comprising drug signatures for a plurality of com- 
pounds, wherein said drug signature comprise a representation of the physical 
and chemical characteristics of each compound, data regarding the effect of 
each compound on the transcription of a plurality of genes, and data regarding 

30 the effect of each compound on a plurality of proteins; 

b) displaying a selected drug signature; 

c) displaying correlation information related to said drug signature to 
facilitate generation of a hypothesis; and 
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d) displaying relevant product information to facilitate testing said 
hypothesis. 

8. The method of claim 7, wherein said relevant product information is 
5 selected from the group consisting of: information regarding a bioassay reagent useful 
for measuring activity of an identified enzyme, information regarding a compound 
useful as a positive control, information regarding a compound useful as a negative 
control, information regarding a kit for purifying an identified protein, information 
regarding antibodies for determining and/or isolating substances.-information 

1 0 regarding a compound similar to the test compound useful for further study, addi- 
tional data regarding gene or protein function and/or relationships, sequence data from 
other species, information regarding metabolic and/or signal pathways to which the 
gene or protein belong, information regarding a DNA microarray useful for 
determining expression of the gene and/or. related genes, and information and analysis 

1 5 regarding features of a compound that are likely to be responsible for the observed 
activity. 

9. The method of claim 8, wherein said product information further com- 
prises a hyperlink that facilitates direct purchase of said product. 

20 

10. A system for facilitating exploration of biological and chemical data, 
comprising: 

a database comprising a plurality of standard gene expression profiles, each 
profile comprising a representation of the expression level of a plurality of genes in a 
25 cell exposed to a standard compound and a representation of the standard compound; 
input means for accepting data and user selections; 
selection means for selecting a gene expression profile; 
correlation selection means for identifying correlation information related to 
said gene expression profile; 
30 product information selection means for selecting information regarding 

relevant products related to said gene expression profile; and 

display means for displaying information regarding said gene expression 

profile. 
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1 1 . The system of claim 10, wherein said database further comprises drug 
signatures for a plurality of compounds, wherein each said drug signature comprises a 
representation of the physical and chemical characteristics of each compound, data 
regarding the effect of each compound on the transcription of a plurality of genes, and 

5 data regarding the effect of each compound on a plurality of proteins. 

12. A system for facilitating exploration of biological and chemical data, 
comprising: 

a database comprising drug signatures for a plurality of compounds, wherein 
10 each said drug signature comprises a representation of the physical and chemical 

characteristics of each compound, data regarding the effect of each compound on the 
transcription of a plurality of genes, and data regarding the effect of each compound 
on a plurality of proteins; 

input means for accepting data and user selections; 
15 selection means for selecting a gene expression profile; 

correlation selection means for identifying correlation information related to 
said gene expression profile; 

product information selection means for selecting information regarding 
relevant products related to said gene expression profile; and 
20 display means for displaying information regarding said gene expression 

profile. 
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